Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addition of the InitialDataAdder abstraction #383

Merged
merged 6 commits into from
Jul 11, 2023

Conversation

robertbartel
Copy link
Contributor

A portion of a larger set of changes involved in work with moving to composite configuration dataset types. Breaking this off parts like this that are fairly self contained for easier review.

Additions

  • A new InitialDataAdder abstract type and interface, used to add some initial data to a new dataset as it is being created

Changes

  • Modifying the DatasetManager interface in its create function to use InitialDataAdder instead of a str for its initial_data param
  • Modifying the ObjectStoreManager implementation of DatasetManager to have its create function comply with interface changes and use the initial_data object as a InitialDataAdder appropriately.

Creating abstract InitialDataAdder type for use with creating datasets,
and modifying Dataset.create abstract function to use this type for the
'initial_data' parameter, rather than a string that represents the
location of some kind of initial data.
Updating DatasetManager implementation's create function to support
usage of the InitialDataAdder abstraction for the 'initial_data' param.
Updating to depend on core 0.9.0.
@robertbartel robertbartel added enhancement New feature or request maas MaaS Workstream labels Jul 10, 2023
@aaraney
Copy link
Member

aaraney commented Jul 10, 2023

At first glance, it appears that this is adding responsibility to dataset creation that creation should not be responsible for. Do you think could handle this use case if we instead introduced a dataset transaction or workflow abstraction? For starters we could limit the functionality to only operate on a single dataset. It seems easier to model data transformation pipelines using a workflow (weaker guarantees than transaction) or transaction.

@robertbartel
Copy link
Contributor Author

At first glance, it appears that this is adding responsibility to dataset creation that creation should not be responsible for.

My perspective is actually the opposite. This is creating an abstract type dedicated to obtaining and adding initial data to a dataset. Regarding whether dataset creation should be tied to adding data ...

Do you think could handle this use case if we instead introduced a dataset transaction or workflow abstraction?

A transactional approach is essentially the intent here. If/when there is some expected, initial data for a dataset, but it cannot be added, then DatasetManager.create() should (effectively) not create the dataset. For example, don't create a composite config dataset at all - even if we can/did add the realization config file to it just fine - if there is a failure copying files from the standalone BMI init config dataset we are supposed to copy from.

Of course, the manager must create that dataset to have something to add to. My solution was to ensure initial data was added inside create() by developing an abstraction somewhat similar to a Visitor, passed to and used by create(), that would encapsulate the logic for adding any initial data. That way, if there is trouble, create() can catch any exceptions and clean up the bad dataset.

This may make more sense after seeing some of the initial implementations. I just didn't include those here because they were in the dmod.dataservice package and not needed for this part.

@aaraney
Copy link
Member

aaraney commented Jul 10, 2023

My perspective is actually the opposite. This is creating an abstract type dedicated to obtaining and adding initial data to a dataset. Regarding whether dataset creation should be tied to adding data ...

Right, create() is not responsible for the logic of uploading initial data, but it is responsible for making and handling the call. It is responsible for handling its own failures and now any failures when adding initial data. To be fair, I think I am overthinking and over engineering this. In an ideal world, having an separate Transaction object that handles rollback logic akin to a db seems perfect for this problem. But, given our storage layer and the problem at hand, implementing a transaction abstraction is probably overkill.

Copy link
Member

@aaraney aaraney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like a minor change is needed to clean up a bucket's objects before removing the bucket in the case of a failure.

# manager and the object store itself
except Exception as e:
self.datasets.pop(name)
self._client.remove_bucket(name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will also need to remove any files that were possibly written to the bucket (i.e. if add_initial_data() successfully added a portion of its data). remove_bucket only removes empty buckets.

Fixing ObjectStoreDatasetManager create() function so that it calls the
delete() function if initial data fails to be added (and thus the
dataset does not need to end up created); previous logic that just
removed the bucket would fail if the bucket had anything in it.
Copy link
Member

@aaraney aaraney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thanks, @robertbartel!

@robertbartel robertbartel merged commit 16f17cd into NOAA-OWP:master Jul 11, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request maas MaaS Workstream
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants